By Jongmin Lim BCB420
The purpose of this analysis is to identify the transcriptomic difference between iPSC and iPSC derived astrocyte. There are 5 patients of iPSC cell and corresponding iPSC derived astrocyte. After load raw data, data has been filtering based on feature a leat 1 read per million in n of the samples, n as the size of the smallest groups of replicates, which is 2 replicates. After filtering, trimm mean of m value, and log normalize the data, HUGO symboles mapped to the corresponding gene. Gene with NA as HUGO symboles removed. And later, all gene with no HUGO symboles is removed.
Limma-Trend and Negative Binomial Generalized LInear Models with Quasi-likelihood test from limma v3.40.6 and edgeR package v3.26.8 perfomred for differnetial gene expression analysis. 9977 genes met the thresdhol for statistical significance with P value less than 0.05. 8684 genes pass the correction. Differential genes, upregualted and downregualted, analyzed with thresholded overrepresentation analysis using g:profiler. The heatmap result shows the clusters of gene that differentiate between iPSC and iPSC derived astrocyte.
GSEA will be use as non-thresholded pathwasy analysis and use cytoscape with enrichment map pipiline to summarize tand visualize the results.
if (!requireNamespace("colorRamps", quietly=TRUE)) {
install.packages("colorRamps")
}
if (!requireNamespace("doBy", quietly=TRUE)) {
install.packages("doBy")
}
if (!requireNamespace("gprofiler2", quietly=TRUE)) {
install.packages("gprofiler2")
}
if (!requireNamespace("colorRamps", quietly=TRUE)) {
install.packages("colorRamps")
}
#install required R and bioconductor packages
tryCatch(expr = { library("RCurl")},
error = function(e) { install.packages("RCurl")},
finally = library("RCurl"))
#use library
tryCatch(expr = { library("limma")},
error = function(e) { source("https://bioconductor.org/biocLite.R")
biocLite("limma")},
finally = library("limma"))
tryCatch(expr = { library("Biobase")},
error = function(e) { source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")},
finally = library("Biobase"))
tryCatch(expr = { library("ggplot2")},
error = function(e) { install.packages("ggplot2")},
finally = library("ggplot2"))
#For creating json and communicating with cytoscape
tryCatch(expr = { library("httr")},
error = function(e) { install.packages("httr")},
finally = library("httr"))
tryCatch(expr = { library("RJSONIO")},
error = function(e) { install.packages("RJSONIO")},
finally = library("RJSONIO"))
#List of packages that need to be active for this project
library(GEOmetadb)
library(knitr)
library(tidyr)
library(biomaRt)
library(edgeR)
library(BiocGenerics)
library(ComplexHeatmap)
library(circlize)
library(colorRamps)
library(dplyr)
library(doBy)
library(gprofiler2)
library(kableExtra)
library(limma)
This dataset is RNAseq expression of iPSC and iPSC-dervied Astrocyte. The purpose of this database is to validate identity of iPSC-derived astrocyte by looking at expression RBA. This is important since they need to get correct cell type to proceed to experiment.
#GEO description of of my dataset
gse <- getGEO("GSE116124", GSEMatrix=FALSE)
kable(data.frame(head(Meta(gse))), format = "html")
| contact_address | contact_city | contact_country | contact_email | contact_institute | contact_laboratory |
|---|---|---|---|---|---|
| Avinguda Gran Via, 199-203 | Hospitalet de Llobregat (Barcelona) | Spain | meritxell.pons@ub.edu | Institute of Biomedicine of the University of Barcelona (IBUB) | Neural Commitment and Differentiation Lab |
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))
Platform Title: Illumina HiSeq 2000 (Homo sapiens)
Original submission date: Nov 02 2010
Last update date: Mar 27 2019
Organism: Homo sapiens
No. of GEO datasets that use this technology: 7897
No. of GEO samples that use this technology: 122103
#Asign the file
sfiles = getGEOSuppFiles('GSE116124')
fnames = rownames(sfiles)
cell_exp = read.delim(fnames[1], header=TRUE, check.names = FALSE)
#Load the dataset with html table format
kable(cell_exp[1:15,1:5], format = "html")
| id_gene,gene_name,gene_type | AD5607 | AD5609 | AD5610 | AD5611 |
|---|---|---|---|---|
| ENSG00000223972.5,DDX11L1,transcribed_unprocessed_pseudogene | 0 | 0 | 0 | 0 |
| ENSG00000227232.5,WASH7P,unprocessed_pseudogene | 356 | 288 | 110 | 130 |
| ENSG00000278267.1,MIR6859-1,miRNA | 0 | 0 | 0 | 0 |
| ENSG00000243485.5,MIR1302-2HG,lincRNA | 4 | 6 | 0 | 0 |
| ENSG00000284332.1,MIR1302-2,miRNA | 0 | 0 | 0 | 0 |
| ENSG00000237613.2,FAM138A,lincRNA | 0 | 0 | 0 | 0 |
| ENSG00000268020.3,AL627309.6,unprocessed_pseudogene | 0 | 0 | 0 | 0 |
| ENSG00000240361.2,OR4G11P,transcribed_unprocessed_pseudogene | 0 | 2 | 0 | 0 |
| ENSG00000186092.5,OR4F5,protein_coding | 0 | 1 | 0 | 0 |
| ENSG00000238009.6,AL627309.1,lincRNA | 10 | 16 | 28 | 17 |
| ENSG00000239945.1,AL627309.3,lincRNA | 6 | 0 | 26 | 10 |
| ENSG00000233750.3,CICP27,processed_pseudogene | 117 | 45 | 39 | 63 |
| ENSG00000268903.1,AL627309.7,processed_pseudogene | 575 | 590 | 19 | 26 |
| ENSG00000269981.1,AL627309.8,processed_pseudogene | 107 | 127 | 0 | 9 |
| ENSG00000239906.1,AL627309.2,antisense_RNA | 1 | 8 | 0 | 0 |
#Separate the first column in to 3 total based on the id, name,and type of gene
cell_exp <- separate(cell_exp, col = "id_gene,gene_name,gene_type", into = c("id_gene", "gene_name", "gene_type"), sep = "\\,")
#Assign the column names
colnames(cell_exp) <- list("ensembl_gene_id", "gene_name", "gene_type",
"iPSC.A", "iPSC.B",
"hASTRO.E", "hASTRO.E",
"hASTRO.A","hASTRO.B",
"hASTRO.C", "hASTRO.D",
"iPSC.C", "iPSC.D")
#Remove unessacry data for this comparison
cell_exp <- cell_exp[,-(6:7)]
#Need to remove decimal point so I can map identifiersd without version
cell_exp$ensembl_gene_id <- gsub(pattern = "\\.\\d+$", replacement = "", x = cell_exp$ensembl_gene_id, ignore.case = TRUE)
#Change to count per million
cpms = cpm(cell_exp[,4:11])
#Set the rowname of first colum of cell_exp dataset
rownames(cpms) <- cell_exp[,1]
#Remove 1 or lower read per million.
#Keep the replicate group that equal or greater than 2
keep = rowSums(cpms >1) >=2
cell_exp_filtered = cell_exp[keep,]
#Define the group
samples <- data.frame(
lapply(colnames(cell_exp)[4:ncol(cell_exp)],
FUN=function(x){
unlist(strsplit(x, split = "\\."))[c(1,2)]})) #Separate based on "."
#Separate columns based on cell type and patient
colnames(samples) <- colnames(cell_exp)[4:ncol(cell_exp)]
rownames(samples) <- c("cell_type", "patient")
samples <- data.frame(t(samples))
samples
## cell_type patient
## iPSC.A iPSC A
## iPSC.B iPSC B
## hASTRO.A hASTRO A
## hASTRO.B hASTRO B
## hASTRO.C hASTRO C
## hASTRO.D hASTRO D
## iPSC.C iPSC C
## iPSC.D iPSC D
#Convert data.frame to matrix
#Need to make sure that use filtered count and matrix
filtered_data_matrix <- as.matrix(cell_exp_filtered[, 4:11])
rownames(filtered_data_matrix) <- cell_exp_filtered$ensembl_gene_id#Add rownames
#Define group as cell type to compare
tmMreq = DGEList(counts = filtered_data_matrix, group = samples$cell_type)
#Calculate normalization factors
tmMreq = calcNormFactors(tmMreq)
#Get normalized data
normalized_counts <- cpm(tmMreq)
#Separate the area to load 2 graphs
par(mfrow= c(1,2))
#Not normalized boxplot
data2plot <- log2(cpm(cell_exp_filtered[,4:11]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Cell RNAseq Samples")
#draw the median of each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
#Normalized boxplot
data2plot <- log2(normalized_counts[,1:7])
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Normalized Cell RNAseq Samples")
#draw the median of each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
#Separate the area to load 2 graphs
par(mfrow= c(1,2))
#Not normalized data (left)
counts_density <- apply(log2(cpm(cell_exp_filtered[, 4:11])), 2, density)
#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)){
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main = "Cell RNAseq Samples", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border = "blue", text.col = "green4",
merge = TRUE, bg = "gray90")
#Normalized data (right)
normalized_density <- apply(log2(normalized_counts[, 1:7]), 2, density)
#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_density)){
xlim <- range(c(xlim, normalized_density[[i]]$x));
ylim <- range(c(ylim, normalized_density[[i]]$y))
}
cols <- rainbow(length(normalized_density))
ltys <- rep(1, length(normalized_density))
#plot the first density plot ot initialize the plot
plot(normalized_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of normalized log2-CPM", main = "Normalized Cell RNAseq Samples", cex.lab = 0.85)
#plot each line
for (i in 1:length(normalized_density)) lines(normalized_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border = "blue", text.col = "green4",
merge = TRUE, bg = "gray90")
#MDS plot to see the distance between samples
plotMDS(tmMreq, labels=rownames(samples),
main = "MDS Plot of Samples",
col = c("darkgreen", "darkblue")[factor(samples$cell_type)])
#Estimate common and tagwise dispersion
model_design <- model.matrix(~ samples$patient)
dispersion <- estimateDisp(tmMreq, model_design)
#Plot BCV with sample
plotBCV(dispersion, col.tagwise = "black", col.common = "red", main = "Biological Coefficient of Variation of Samples")
#Separate the area to load 2 graphs
par(mfrow= c(1,2))
plotMeanVar(dispersion,
show.raw.vars = TRUE,
show.tagwise.vars=TRUE,
NBline=TRUE,
show.ave.raw.vars = TRUE,
show.binned.common.disp.vars = TRUE,
main = "Mean-Variance Relationship of Samples")
#Connect to ensembl mart and limit to human datasets
#Using grch37 because I keep get error when using most updated ensembl
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl", host="grch37.ensembl.org")
## Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'RJSONIO'
## Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'RJSONIO'
#Filter
biomart_human_filters <- listFilters(ensembl) #Will go with ensembl_gene_id_version since my dataset contain version
#Attributes
kable(searchAttributes(mart = ensembl, 'hgnc') , format="html") #Will go with hgnc_symbol
| name | description | page | |
|---|---|---|---|
| 62 | hgnc_id | HGNC ID | feature_page |
| 63 | hgnc_symbol | HGNC symbol | feature_page |
| 64 | hgnc_trans_name | HGNC transcript name ID | feature_page |
#Check to see if cell_id_conversion file exists
conversion_stash <- "cell_id_conversion.rds"
if(file.exists(conversion_stash)){
cell_id_conversion <- readRDS(conversion_stash)
}else{
cell_id_conversion <- getBM(attributes =
c("ensembl_gene_id", "hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = factor(cell_exp_filtered$ensembl_gene_id), #Values is first colum of my expression matrix
mart = ensembl)
saveRDS(cell_id_conversion, conversion_stash)
}
#Merge the new identifier
normalized_counts_annot <- merge(cell_id_conversion, normalized_counts, by.x = 1, by.y = 0, all.y = TRUE)
kable(normalized_counts_annot[1:5,1:5],type = "html")
| ensembl_gene_id | hgnc_symbol | iPSC.A | iPSC.B | hASTRO.A |
|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 49.9718769 | 49.3636305 | 47.307659 |
| ENSG00000000419 | DPM1 | 68.2095759 | 40.3461140 | 35.223430 |
| ENSG00000000457 | SCYL3 | 5.3520520 | 6.4867941 | 9.453908 |
| ENSG00000000460 | C1orf112 | 33.0721909 | 15.1552454 | 6.251778 |
| ENSG00000000971 | CFH | 0.0290872 | 0.0290888 | 19.517745 |
#Number of identifier are missing
ensembl_id_missing_gene <- normalized_counts_annot$ensembl_gene_id[
which(is.na(normalized_counts_annot$hgnc_symbol))]
length(ensembl_id_missing_gene)
## [1] 489
#Yes. (489/16993) = 0.029. About 2.9% of my dataset miss identifiers.
#Collect all the duplicate of hgnc_symbol
hugoDuplicated <- normalized_counts_annot[duplicated(normalized_counts_annot$hgnc_symbol), 1:2]
#Collect all the non empty string for hgnc_synbol
hugoEmptyDuplicated <- hugoDuplicated[!(nchar(hugoDuplicated$hgnc_symbol) == 0),]
#Collect all the non NA for hgnc_synbol
nohugoEmptyNADuplicated <- hugoEmptyDuplicated[!is.na(hugoEmptyDuplicated$hgnc_symbol),]
#Only remove gene with NA in hgnc_symbol
finalized_dataset <- normalized_counts_annot
head(finalized_dataset)
## ensembl_gene_id hgnc_symbol iPSC.A iPSC.B hASTRO.A hASTRO.B
## 1 ENSG00000000003 TSPAN6 49.97187691 49.36363053 47.307659 36.538029
## 2 ENSG00000000419 DPM1 68.20957588 40.34611405 35.223430 32.577734
## 3 ENSG00000000457 SCYL3 5.35205201 6.48679411 9.453908 8.850399
## 4 ENSG00000000460 C1orf112 33.07219095 15.15524544 6.251778 7.955028
## 5 ENSG00000000971 CFH 0.02908724 0.02908876 19.517745 13.017319
## 6 ENSG00000001036 FUCA2 35.95182763 58.84656721 54.741175 57.407064
## hASTRO.C hASTRO.D iPSC.C iPSC.D
## 1 38.334551 48.529023 41.0380737 68.83296439
## 2 36.694033 31.360565 60.4500229 36.19971961
## 3 10.914466 9.546098 7.1278251 6.71686095
## 4 7.265151 8.965347 24.9322222 25.53001572
## 5 46.235820 34.699885 0.1213247 0.05944125
## 6 85.039090 88.056402 37.7319762 60.03565979
#Normalize the data
finalized_dataset %>%
mutate_at(vars(-ensembl_gene_id, -hgnc_symbol), funs(.+1)) %>%
mutate_at(vars(-ensembl_gene_id, -hgnc_symbol), funs(log2(.))) -> new_normalized_count_data
#Remove any duplicate ensembl_gene_id
new_normalized_count_data <- new_normalized_count_data[!duplicated(new_normalized_count_data$ensembl_gene_id),]
#Check
kable(new_normalized_count_data[1:10,], type="html")%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| ensembl_gene_id | hgnc_symbol | iPSC.A | iPSC.B | hASTRO.A | hASTRO.B | hASTRO.C | hASTRO.D | iPSC.C | iPSC.D |
|---|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 5.6716296 | 5.6543104 | 5.594180 | 5.230281 | 5.297725 | 5.630202 | 5.3936247 | 6.1258363 |
| ENSG00000000419 | DPM1 | 6.1128998 | 5.3696798 | 5.178851 | 5.069433 | 5.236264 | 5.016165 | 5.9413416 | 5.2172198 |
| ENSG00000000457 | SCYL3 | 2.6672227 | 2.9043481 | 3.385970 | 3.300182 | 3.574642 | 3.398637 | 3.0228694 | 2.9480141 |
| ENSG00000000460 | C1orf112 | 5.0905228 | 4.0139308 | 2.858335 | 3.162698 | 3.047041 | 3.316920 | 4.6966739 | 4.7295536 |
| ENSG00000000971 | CFH | 0.0413653 | 0.0413674 | 4.358800 | 3.809138 | 5.561809 | 5.157847 | 0.1652041 | 0.0833036 |
| ENSG00000001036 | FUCA2 | 5.2075738 | 5.9031966 | 5.800671 | 5.868071 | 6.426920 | 6.476647 | 5.2754532 | 5.9315805 |
| ENSG00000001084 | GCLC | 6.0318414 | 5.2893284 | 4.361478 | 4.632048 | 4.755631 | 4.598725 | 5.7067888 | 5.6577783 |
| ENSG00000001167 | NFYA | 5.7438846 | 5.3461439 | 5.036974 | 4.741960 | 5.222099 | 5.389108 | 5.8524971 | 5.7514902 |
| ENSG00000001460 | STPG1 | 2.9700256 | 3.4449558 | 3.563921 | 3.891790 | 3.691407 | 3.274257 | 3.3576937 | 3.8152756 |
| ENSG00000001461 | NIPAL3 | 3.5596589 | 3.7051759 | 6.746265 | 7.702369 | 5.860883 | 5.709398 | 3.7991991 | 3.9819321 |
#Numerical matrix for heatmap
heatmap_matrix <- new_normalized_count_data[,3:ncol(new_normalized_count_data)]
#Assign rowname by ensemble_gene_id
rownames(heatmap_matrix) <- make.names(new_normalized_count_data$ensembl_gene_id, unique=TRUE)
#Assign colnames by gene names
colnames(heatmap_matrix) <- colnames(new_normalized_count_data[,3:ncol(new_normalized_count_data)])
#Model design based on samples cell type
model_design <- model.matrix(~samples$cell_type)
#Show the table
kable(model_design, type="html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| (Intercept) | samples$cell_typeiPSC |
|---|---|
| 1 | 1 |
| 1 | 1 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
#Create matrix that contain infomration of normalized data from column 3 to end
expressionMatrix <- as.matrix(new_normalized_count_data[,3:ncol(new_normalized_count_data)])
#Ensemble gene as row name
rownames(expressionMatrix) <- new_normalized_count_data$ensembl_gene_id
#Set column name based on normalized data
colnames(expressionMatrix) <- colnames(new_normalized_count_data)[3:ncol(new_normalized_count_data)]
#Create minimal set by using assayData, which must contain a matrix expression with rows and representing features and columnes representing samples
minimalSet <- ExpressionSet(assayData=expressionMatrix)
#Fit data to the above model
fit <- lmFit(minimalSet, model_design)
#Apply empitical bayes to fit data
ebayes_fit <- eBayes(fit, trend=TRUE)
#Collect the data that adjust method, which is BH, applied
topfit <- topTable(ebayes_fit,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#Merge hgnc names to topfit table
output_hits <- merge(new_normalized_count_data[,1:2],
topfit,
by.y = 0, by.x = 1,
all.y = TRUE)
#Sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
#Table
kable(output_hits[1:10,], type="html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 11052 | ENSG00000174099 | MSRB3 | -5.728253 | 4.526069 | -31.58745 | 0 | 2.9e-06 | 13.74448 |
| 9937 | ENSG00000167123 | CERCAM | -4.750546 | 5.418317 | -29.91980 | 0 | 2.9e-06 | 13.39331 |
| 8635 | ENSG00000159216 | RUNX1 | -6.816623 | 4.104574 | -28.95348 | 0 | 2.9e-06 | 13.17504 |
| 4058 | ENSG00000117228 | GBP1 | -5.720482 | 3.007139 | -27.50517 | 0 | 2.9e-06 | 12.82560 |
| 13558 | ENSG00000198796 | ALPK2 | -7.574717 | 4.591906 | -27.04401 | 0 | 2.9e-06 | 12.70833 |
| 4224 | ENSG00000119242 | CCDC92 | -5.380433 | 3.097919 | -26.99704 | 0 | 2.9e-06 | 12.69621 |
| 9741 | ENSG00000166147 | FBN1 | -6.899814 | 7.454708 | -26.52532 | 0 | 2.9e-06 | 12.57275 |
| 9042 | ENSG00000162849 | KIF26B | -5.353285 | 3.623001 | -26.33549 | 0 | 2.9e-06 | 12.52213 |
| 3169 | ENSG00000109099 | PMP22 | -7.085657 | 4.996938 | -25.57108 | 0 | 3.2e-06 | 12.31267 |
| 7668 | ENSG00000148516 | ZEB1 | -7.011617 | 3.941617 | -25.26380 | 0 | 3.2e-06 | 12.22585 |
length(which(output_hits$P.Value < 0.05))
## [1] 10741
length(which(output_hits$adj.P.Val < 0.05))
## [1] 9869
#Create a design matrix
model_design_pat <- model.matrix(
~samples$patient + samples$cell_type) #Design matrix based on cell type and sample group
#Check the matrix
table_model_design_pat <- as.data.frame(model_design_pat)
table_model_design_pat
## (Intercept) samples$patientB samples$patientC samples$patientD
## 1 1 0 0 0
## 2 1 1 0 0
## 3 1 0 0 0
## 4 1 1 0 0
## 5 1 0 1 0
## 6 1 0 0 1
## 7 1 0 1 0
## 8 1 0 0 1
## samples$cell_typeiPSC
## 1 1
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
## 7 1
## 8 1
fit_pat <- lmFit(minimalSet, model_design_pat)
#The parameter trend = TRUE is specific to RND-seq data
ebayes_fit_pat <- eBayes(fit_pat, trend = TRUE)
topfit_pat <- topTable(ebayes_fit_pat,
coef = ncol(model_design_pat),
adjust.method = "BH",
number = nrow(expressionMatrix))
#Merge hgnc names to topfit table
output_hits_pat <- merge(new_normalized_count_data[, 1:2],
topfit_pat, by.y=0, by.x=1, all.y=TRUE)
#Sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]
#Check
kable(output_hits_pat[1:10,], type="html")%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 190 | ENSG00000008441 | NFIX | -7.286118 | 4.519172 | -38.40576 | 0e+00 | 0.0001382 | 9.189616 |
| 8635 | ENSG00000159216 | RUNX1 | -6.816623 | 4.104574 | -37.51832 | 0e+00 | 0.0001382 | 9.113968 |
| 4272 | ENSG00000119681 | LTBP2 | -7.623419 | 6.060059 | -37.04555 | 0e+00 | 0.0001382 | 9.072156 |
| 2506 | ENSG00000103888 | KIAA1199 | -6.094511 | 6.036422 | -36.64151 | 0e+00 | 0.0001382 | 9.035557 |
| 509 | ENSG00000041982 | TNC | -6.931249 | 7.216754 | -32.52452 | 1e-07 | 0.0002231 | 8.611453 |
| 9741 | ENSG00000166147 | FBN1 | -6.899814 | 7.454708 | -31.53934 | 1e-07 | 0.0002231 | 8.494229 |
| 399 | ENSG00000026508 | CD44 | -5.801883 | 6.891218 | -29.80103 | 1e-07 | 0.0002300 | 8.269947 |
| 11052 | ENSG00000174099 | MSRB3 | -5.728253 | 4.526069 | -29.45561 | 1e-07 | 0.0002300 | 8.222518 |
| 11294 | ENSG00000176046 | NUPR1 | -6.668329 | 3.755843 | -29.29102 | 1e-07 | 0.0002300 | 8.199566 |
| 9477 | ENSG00000164694 | FNDC1 | -6.484058 | 4.635066 | -27.55935 | 2e-07 | 0.0002938 | 7.943458 |
length(which(output_hits_pat$P.Value < 0.05))
## [1] 10263
length(which(output_hits_pat$adj.P.Value < 0.05))
## [1] 0
#Set up edgeR objects
edge_obj = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
#Estimate Dispersion - our model design
edge_obj <- estimateDisp(edge_obj, model_design_pat)
#Fit the model
fit <- glmQLFit(edge_obj, model_design_pat)
qlf.iPSC_vs_hASTRO <- glmQLFTest(fit, coef='samples$cell_typeiPSC')
#plot the table
topTags(qlf.iPSC_vs_hASTRO)
## Coefficient: samples$cell_typeiPSC
## logFC logCPM F PValue FDR
## ENSG00000147883 -6.782249 5.561304 1157.7616 8.891990e-08 0.0009823517
## ENSG00000176046 -8.428381 5.999789 1054.9682 1.156184e-07 0.0009823517
## ENSG00000119681 -7.686570 8.873231 762.0897 2.893571e-07 0.0012926821
## ENSG00000117226 -8.850286 4.040764 708.0475 3.559903e-07 0.0012926821
## ENSG00000174099 -5.977585 6.273533 682.7683 3.943740e-07 0.0012926821
## ENSG00000000971 -8.535409 3.672329 575.9320 6.366409e-07 0.0012926821
## ENSG00000164106 -10.275179 5.457191 557.4834 6.976773e-07 0.0012926821
## ENSG00000162849 -6.110159 5.183740 533.9295 7.876935e-07 0.0012926821
## ENSG00000159216 -7.965852 6.387569 533.5658 7.892034e-07 0.0012926821
## ENSG00000214336 8.112168 3.740764 530.2342 8.032190e-07 0.0012926821
qlf_output_hits <- topTags(qlf.iPSC_vs_hASTRO,
sort.by = "PValue",
n = nrow(new_normalized_count_data),
adjust.method = "BH") #Use BH as adjust method for this assignment
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 10217
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 9042
#Calculate normalization factors
edge_objs <- calcNormFactors(edge_obj)
#Dit model
fits <- glmQLFit(edge_objs, model_design_pat)
#calculate differential expression
qlf.hiPSC_vs_hASTRO <- glmQLFTest(fits, coef='samples$cell_typeiPSC')
qlf_output_hit <- topTags(qlf.hiPSC_vs_hASTRO,sort.by = "PValue",
n = nrow(new_normalized_count_data),
adjust.method = "BH")
length(which(qlf_output_hit$table$PValue < 0.05))
## [1] 9977
length(which(qlf_output_hit$table$FDR < 0.05))
## [1] 8684
topTags(qlf.hiPSC_vs_hASTRO)
## Coefficient: samples$cell_typeiPSC
## logFC logCPM F PValue FDR
## ENSG00000147883 -7.072359 5.561304 1356.5603 6.228246e-08 0.0006348592
## ENSG00000176046 -8.724601 5.999789 1271.3433 7.472008e-08 0.0006348592
## ENSG00000117226 -9.127679 4.040764 805.1695 2.688471e-07 0.0011341151
## ENSG00000119681 -7.979054 8.873231 729.4311 3.544762e-07 0.0011341151
## ENSG00000174099 -6.269017 6.273533 663.8840 4.612822e-07 0.0011341151
## ENSG00000000971 -8.831714 3.672329 651.8760 4.854351e-07 0.0011341151
## ENSG00000162849 -6.402484 5.183740 640.2579 5.104639e-07 0.0011341151
## ENSG00000159216 -8.260542 6.387569 610.8815 5.820706e-07 0.0011341151
## ENSG00000164106 -10.535707 5.457191 604.0468 6.006612e-07 0.0011341151
## ENSG00000103888 -6.315549 8.034202 554.7927 7.617441e-07 0.0011660431
length(which(qlf_output_hit$table$PValue < 0.05 & qlf_output_hit$table$logFC > 0))
## [1] 4930
length(which(qlf_output_hit$table$PValue < 0.05 & qlf_output_hit$table$logFC < 0))
## [1] 5047
#merge gene names with the top hits
qlf_output_hits_withgn <- merge(new_normalized_count_data[,1:2],qlf_output_hit, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
#P values from data that apply multiple hypothesis testing
qlf_cell_model_pvalues <- data.frame(ensembl_id = rownames(qlf_output_hits$table),
qlf_cell_pvalue=qlf_output_hits$table$PValue)
#downregulated when differentiate iPSC to astrocyte
downregulated_gene <- qlf_output_hits_withgn[which(qlf_output_hits_withgn$PValue < 0.05 & qlf_output_hits_withgn$logFC > 0),]
upregulated_gene <- qlf_output_hits_withgn[which(qlf_output_hits_withgn$PValue < 0.05 & qlf_output_hits_withgn$logFC < 0),]
#Plot for gene of interest CD44
ensembl_of_interest <- new_normalized_count_data$ensembl_gene_id[
which(new_normalized_count_data$hgnc_symbol == "CD44")]
#Downregulated gene ensembl_id
downregulated_gene <- data.frame(ensembl_id = downregulated_gene$ensembl_gene_id)
#Upregualted gene ensembl_id
upregulated_gene <- data.frame(ensembl_id = upregulated_gene$ensembl_gene_id)
qlf_cell_model_pvalues$colour <- "grey" #Grey for all genes
qlf_cell_model_pvalues$colour[upregulated_gene$ensembl_id] <- "red" #Red for upregulated
qlf_cell_model_pvalues$colour[downregulated_gene$ensembl_id] <- "blue" #Blue for downregulated
qlf_cell_model_pvalues$colour[qlf_cell_model_pvalues$ensembl_id==ensembl_of_interest] <- "orange" #Orange for gene of interest
volcanoplot(ebayes_fit,
coef = ncol(ebayes_fit),
ylab = "M-ratio log expression",
cex = ifelse(qlf_cell_model_pvalues$colour == "orange", 2, 0.3),
col = qlf_cell_model_pvalues$colour,
main = "Upregulated genes vs Downregulated genes")
#Set top hit
top_hits <- rownames(qlf_output_hits$table)[output_hits_pat$P.Value<0.05]
#Set heatmap matrix tophits
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
#Sorth the column by cell type
#Organize by cell type
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits), pattern = "^iPSC"),
grep(colnames(heatmap_matrix_tophits), pattern = "^hASTRO"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), #if no negative value in heatmap matrix,
c( "white", "red")) #use white and red color
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red")) #blue, white, red if heatmap matrix contain negative value
}
#Plot
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col = heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_title = "Heatmap for top hits")
current_heatmap
downregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05 #downregulated when differentiate iPSC to astrocyte
& qlf_output_hits_withgn$logFC > 0)]
upregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)] #upregulated when differentiate iPSC to astrocyte
write.table(x=downregulated_genes,
file="cell_downregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE) #save as downregulated file
write.table(x=upregulated_genes,
file="cell_upregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE) #save as upregulated file
#Check to make sure upregulated genes list contain the astrocyte specific gene
upregulated_genes_dataframe <- as.data.frame(upregulated_genes)
#List of gene that expressed in astrocyte
list_upregulated <- c(upregulated_genes_dataframe[upregulated_genes_dataframe == "CD44"],
upregulated_genes_dataframe[upregulated_genes_dataframe == "SOX9"],
upregulated_genes_dataframe[upregulated_genes_dataframe == "DIO2"])
#Remove NA
list_upregulated <- list_upregulated[!is.na(list_upregulated)]
list_upregulated
## [1] "CD44" "SOX9" "DIO2"
#Check to make sure upregulated genes list contain the astrocyte specific gene
downregulated_genes_dataframe <- as.data.frame(downregulated_genes)
#List of gene that expressed in iPSC
list_downregulated <- c(downregulated_genes_dataframe[downregulated_genes_dataframe == "MYC"],
downregulated_genes_dataframe[downregulated_genes_dataframe == "NANOG"],
downregulated_genes_dataframe[downregulated_genes_dataframe == "SALL4"])
#Remove NA
list_downregulated <- list_downregulated[!is.na(list_downregulated)]
list_downregulated
## [1] "MYC" "NANOG" "SALL4"
I use Gene Set Enrichment Analysis (GSEA) becuase it is currently most used method for non-thresholded analysis. GSEA has Java GUI that serve as easy tool for researchers who want to do non-threshold analysis without writing down codes.The version of GSEA that I use is GSEA 4.0.3 Java GUI.
I use “Human_GOBP_AllPathways_no_GO_iea_February_01_2020_symbol.gmt” because it includes human geneset of MsigBD, Institue of BIoinformatics, GO, Reactome, and Panther. link
# Download February GMT file from bader lab website
gmt_url = "http://download.baderlab.org/EM_Genesets/February_01_2020/Human/symbol/"
#list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
#get the gmt that has all the pathways and does not include terms inferred from electronic annotations(IEA)
#start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
working_dir <- "/home"
dest_gmt_file <- file.path(working_dir,paste(gmt_file, sep=""))
download.file(
paste(gmt_url,gmt_file,sep=""),
destfile=dest_gmt_file
)
#Compute ranks
qlf_output_hits_withgn[,"rank"] <- log(qlf_output_hits_withgn$PValue, base = 10)*
sign(qlf_output_hits_withgn$logFC)
#Sort table by ranks
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank, decreasing = TRUE),]
#Remove any gene that don't have hgnc_symbol or label as NA
qlf_output_hits_withgns <- qlf_output_hits_withgn[!(nchar(qlf_output_hits_withgn$hgnc_symbol) == 0),]
qlf_output_hits_withgns <- qlf_output_hits_withgns[!is.na(qlf_output_hits_withgns$hgnc_symbol),]
#Create dataframe that only contain hgnc symbol and rank
qlf_output_hits_withgns <- qlf_output_hits_withgns[,c("hgnc_symbol", "rank")]
#Table
kable(qlf_output_hits_withgns [1:5,], type="html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| hgnc_symbol | rank | |
|---|---|---|
| 7603 | CDKN2B | 7.205634 |
| 11294 | NUPR1 | 7.126563 |
| 4057 | GBP3 | 6.570495 |
| 4272 | LTBP2 | 6.450413 |
| 11052 | MSRB3 | 6.336033 |
#Save as file for GSEA
write.table(qlf_output_hits_withgns,file="cell_expression.rnk",quote=F,sep="\t",row.names=F, col.names = F)
###Summary of Enrichment### This is the summary of enrichment that reached statistical significance in the p-value and FDR
As expected, pathways that related to characteristic of astrocyte detected as upregulated pathways. The first four upregulated pathways are: - Epithelial to mesenchymal transition - Elastic Fibre Formation - Molecules associated with elastic fibres - TNFA singlaing via NFKB
Epithelial to mesenchymal transition should be upregulated since iPSC need to go under differentiation to become astrocyte. Therefore, need to downregulated the epithelial characteristic (iPSC) and upregulate mesenchymal characteristic (astrocyte). This pathway commonly upregulated during embryogenesis and organ development (Kalluri and Weinberg 2009). Synthesis of elastic fibres and associated molecules, such as fibrillin-1 and fibrillin-2, are commonly produced in astrocyte since elastic fibers are one of major components of the extracellular matrix of the nerve head (Pena, Mello, and Hernandez 2000). TNF is cell singlaing protein involve in systemic inflammation and one of cytokines that make up the acute phase reaction. TNF produced in neuron and astrocyte. NFK is transcription regulator that regulate TNF. TNF must be regulated to prevent variety of human disease (Gahring et al. 1996). Most of the pathways that upregulated in GSEA are somewhat similar compare to upregulated pathways in g:profilers. However, some of pathways, such as UV response and positive cartilage regualtion, are not detect in g:profiler while detected in GSEA.
As expected, pathways that related to characteristic of iPSC detected as downregulated pathways The first four downregulated pathways are: - E2F upregulation - MYC uprgulation - DNA replication - DNA dependent DNA replication
E2F is transcription factors that regulate genes that important in cell profileration, specifically progression through G1 and into S-phase. E2F is known replication factors during embryogenesis, which indicates that should expressed in iPSC since iPSC is characteristically similar to embryonic stem cell (Myster, Bonnette, and Duronio 2000). MYC is common transcription factors that used to differentiate adult cell to iPSC (Takahashi et al. 2008). iPSC is the cell that well known for high replication rate because of its pluripotency and differentiation. To maintain pluripotencym, expression of DNA repair and replication must be maintained (Liu et al. 2017). Most of the pathways that down regulated in GSEA are similar to list of down regulated pathways in g:profiler and not detect any differences. This might happen because of unique characteristic of iPSC that is not detected in other type of cell.
Comparison between result of g:profielr and GSEA is not straight forward because of way that g;profiler and GSEA process to get result is different. g;profiler use thresholded list to ask any gene sets that are enriched or depleted in list. It uses fisher exact test and hypergeometri test to assess. However, GSEA uses non-thresholded list to ask any gene sets that are ranked high or low in list. It uses KS test and linear model.
Figure 1. Enrichment map after autoannotation
Figure 2. Subnetwork of upregulated pathways. It divide into 6. A is nodes that related to extracellular matrix, B is node about ebola virus, C is nodes related to regulation, D is nodes related to immunity, E is nodes related to system within cells, and F is node related to junction. Yellow nodes as selected nodes, first four pathways on upreguatled pahtways.
According to original paper, author said that astrocyte should have characteristic of capacity to produce ATP and propagate inercellular Ca2+ (Domenico et al. 2019). I cannot find any pathway that related to produce ATP in subnetwork.This might happen because the pathway that related to capacity to produce ATP is not significant enough. However, I dectect pahtway that propagate intercellular Ca2+ waves, which is platelet cytosolic ca2+ in section E of figure 2. Therefore, enrichment result partly support conclusions discussed in the original paper. There are similarity between erichment result from GSEA and g;profiler result as both shows junction pathways and extracellular matrix pathways as major pathways that expressed in astrocyte. However, unlike enrichment result from GSEA, g:profiler does not shows immunity pathways, ebola virus pathways, and demannosylation pathways as upregulated pathways in astrocyte. This might happen do to different starting data (ranked geneset for GSEA and list of up and down regulated geneset for g:profiler) and different methods for calculation.
There are many types of nodes, but most of nodes are related to either extracellular matrix and immunity.
Figure 3. Subnetwork of downregulated pathways. It divide into 5 sections. A is nodes that related to DNA manipulation. B is nodes that related to export riboculeoprotein, C is nodes that related to g2m checkpoint, D is nodes that related to myc transcription factors expression, and E related to RNA expression and manipulation. Yellow nodes as selected nodes, first four pathways on downregulated pahtways.
There are many similarity between result from g:profiler and GSEA. For example, both results show DNA replicaiton, recombination, repair, ard RNA processing as most common downregulated pathways. However, g:profiler did not shows specific pathways, such as pathways related to fanconi anemia, which related to DNA repair. Since characteristic of iPSC is unique, unlike upregulated pathways in astrocyte, downregulated pathways, which is pathways related to iPSC, are mostly similar even with different methods.
Figure 4. Erichment map after collapse.
Figure 5. subnetwork of collapse network.
The theme of down regulated network (left side, blue) are cell cycle, specifically regulation of cell cycle. The theme of upreguatled network are production extracellular matrix. Original paper did not mention about extracellular matrix while mention that astrocyte should have high expression that related to intercellular Ca2+ waves. There is one collapsed node that related to intercellular ca2+ wave (platelet cytocolic ca2+) (Domenico et al. 2019). Therefore, enrichment results not able to fully support mechanism discussed in the original paper. However, from otehr papers, the production of extracellular matrix is one of characteristic of astrocyte that does not present in iPSC since astrocyte release wide range of extracellular matrix molecule that is important for glial development. Ebola virus pathways is the novel pathway detected in collapsed upregulated network (Domenico et al. 2019). I cannot find any papers that shows relationship between astrocyte and ebola virus. This pathways might assigned as upregulated pathways because author from original paper use virus to transfer specific genetic material to iPSC to differentiate into astrocyte. Downregualted pathways are the pathways that commonly foudn in iPSC, cell cycle (Domenico et al. 2019). Also, c-myc with other trasncription factors used to differentiate fibroblast to iPSC, which shows as downregulated pathways (Takahashi et al. 2008). These conclude that author in original paper successfully differentiate iPSC to astrocyte (Wiese, Karus, and Faissner 2012).
##Post Analysis ##
Figure 6. Erichment map with signature analysis.
Figure 7. Part of erichment map with signature analysis.
I choose ATF3 as transcription factors for post analysis to my main network because ATF3 is driver of astrocyte differentiation from nerual precursor cells. ATF3 and NFIA transcription factors bind to specific binding site that activate eA genes and IA genes stage of early astrogligenesis. Then, RUNx2 transcription factor can bind and prevent eA genes to become mature astrocyte. I add signature gene sets with parameter of Mann Whitney (one-sided greater) with cutoff of 0.05. I use Mann Whiteney one sided greater to look at the relationship of ATF3 transcription factors to one of upregulated pathways. I detect highly engaged in apoptosis pathways, negative epithelial prolierataion pathway, tnfa signalling pathway, epithelial mesenchymal transition pathway, and hypoxia pathways. ATF3 might part of pathways that cause epithelial to mesenchymal transition pathways since ATF3 is require to drive astrocyte differentiation, but not maturation to astrocyte. ATF3 is also weakly related to downregulated pathways, such as metaphase sister chromatid and rna splicing, because at early stage, cell still has little bit stem cell characteristic and need to keep ability to high cell division rate(Tiwari et al. 2018).
Domenico, Angelique di, Giulia Carola, Carles Calatayud, Meritxell Pons-Espinal, Juan Pablo Muñoz, Yvonne Richaud-Patin, Irene Fernandez-Carasa, et al. 2019. “Patient-Specific iPSC-Derived Astrocytes Contribute to Non-Cell-Autonomous Neurodegeneration in Parkinson’s Disease.” Stem Cell Reports 12 (2). Elsevier: 213–29.
Gahring, Lorise C, Noel G Carlson, Rachel A Kulmer, and Scott W Rogers. 1996. “Neuronal Expression of Tumor Necrosis Factor Alpha in the Ovouji Fme Brain.” Neuroimmunomodulation 3 (5). Karger Publishers: 289–303.
Kalluri, Raghu, and Robert A Weinberg. 2009. “The Basics of Epithelial-Mesenchymal Transition.” The Journal of Clinical Investigation 119 (6). Am Soc Clin Investig: 1420–8.
Liu, Kai, Jian Mao, Lipu Song, Anran Fan, Sheng Zhang, Jianyu Wang, Nana Fan, et al. 2017. “DNA Repair and Replication Links to Pluripotency and Differentiation Capacity of Pig iPS Cells.” PloS One 12 (3). Public Library of Science: e0173047.
Myster, Denise L, Peter C Bonnette, and Robert J Duronio. 2000. “A Role for the Dp Subunit of the E2f Transcription Factor in Axis Determination During Drosophila Oogenesis.” Development 127 (15). The Company of Biologists Ltd: 3249–61.
Pena, Janethe DO, Paulo AA Mello, and M Rosario Hernandez. 2000. “Synthesis of Elastic Microfibrillar Components Fibrillin-1 and Fibrillin-2 by Human Optic Nerve Head Astrocytes in Situ and in Vitro.” Experimental Eye Research 70 (5). Elsevier: 589–601.
Takahashi, Kazutoshi, Koji Tanabe, Mari Ohnuki, Megumi Narita, Tomoko Ichisaka, Kiichiro Tomoda, and Shinya Yamanaka. 2008. “Induction of Pluripotent Stem Cells from Adult Human Fibroblasts by Defined Factors.” Obstetrical & Gynecological Survey 63 (3). LWW: 153.
Tiwari, Neha, Abhijeet Pataskar, Sophie Péron, Sudhir Thakurela, Sanjeeb Kumar Sahu, Marı'a Figueres-Oñate, Nicolás Marichal, Laura López-Mascaraque, Vijay K Tiwari, and Benedikt Berninger. 2018. “Stage-Specific Transcription Factors Drive Astrogliogenesis by Remodeling Gene Regulatory Landscapes.” Cell Stem Cell 23 (4). Elsevier: 557–71.
Wiese, Stefan, Michael Karus, and Andreas Faissner. 2012. “Astrocytes as a Source for Extracellular Matrix Molecules and Cytokines.” Frontiers in Pharmacology 3. Frontiers: 120.
n.d. ExpressionSet. https://www.rdocumentation.org/packages/Biobase/versions/2.32.0/topics/ExpressionSet.
n.d. Banjamini-Hochberg-Procedure. https://www.statisticshowto.datasciencecentral.com/benjamini-hochberg-procedure/.